Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Quantization of Neural Networks

TABLE 2.1

Evaluating the components of Q-ViT based on the ViT-S backbone.

Method

#Bits

Top-1

#Bits

Top-1

#Bits

Top-1

Full-precision

32-32

79.9

Baseline

4-4

79.7

3-3

77.8

2-2

68.2

+IRM

4-4

80.2

3-3

78.2

2-2

69.9

+DGD

4-4

80.4

3-3

78.5

2-2

70.5

+IRM+DGD (Q-ViT)

4-4

80.9

3-3

79.0

2-2

72.0

2.4

Q-DETR: An Eﬃcient Low-Bit Quantized Detection Trans-

former

Drawing inspiration from the achievements in natural language processing (NLP), object

detection using transformers (DETR) has emerged as a new approach for training an end-to-

end detector using a transformer encoder-decoder [31]. In contrast to earlier methods [201,

153] that heavily rely on convolutional neural networks (CNNs) and necessitate additional

post-processing steps such as non-maximum suppression (NMS) and hand-designed sample

selection, DETR tackles object detection as a direct set prediction problem.

Despite this attractiveness, DETR usually has many parameters and ﬂoat-pointing op-

erations (FLOPs). For instance, 39.8M parameters comprise 159 MB memory usage and

86G FLOPs in the DETR model with ResNet-50 backbone [84] (DETR-R50). This leads

to unacceptable memory and computation consumption during inference and challenges

deployments on devices with limited resources.

Therefore, substantial eﬀorts on network compression have been made toward eﬃcient

online inference [264, 260]. Quantization is particularly popular for deploying AI chips by

representing a network in low-bit formats. Yet prior post-training quantization (PTQ) for

DETR [161] derives quantized parameters from pre-trained real-valued models, which often

restricts the model performance in a sub-optimized state due to the lack of ﬁne-tuning

on the training data. In particular, the performance drastically drops when quantized to

ultra-low bits (4 bits or less). Alternatively, quantization-aware training (QAT) [158, 259]

performs quantization and ﬁne-tuning on the training dataset simultaneously, leading to

trivial performance degradation even with signiﬁcantly lower bits. Though QAT methods

have been proven to be very eﬀective in compressing CNNs [159, 61] for computer vision

tasks, an exploration of low-bit DETR remains untouched.

In this paper, we ﬁrst build a low-bit DETR baseline, a straightforward solution based

on common QAT techniques [61]. Through an empirical study of this baseline, we observe

signiﬁcant performance drops on the VOC [62] dataset. For example, a 4-bit quantized

DETR-R50 using LSQ [61] only achieves 76.9% AP50, leaving a 6.4% performance gaps

compared with the real-valued DETR-R50. We ﬁnd that the incompatibility of existing

QAT methods mainly stems from the unique attention mechanism in DETR, where the

spatial dependencies are ﬁrst constructed between the object queries and encoded features.

Then a feed-forward network feeds the co-attended object queries into box coordinates

and class labels. A simple application of existing QAT methods on DETR leads to query

information distortion, and therefore the performance severely degrades. Figure 2.8 exhibits

an example of information distortion in query features of 4-bit DETR-R50, where we can see

signiﬁcant distribution variation of the query modules in quantized DETR and real-valued

version. The query information distortion causes the inaccurate focus of spatial attention,

which can be veriﬁed by following [169] to visualize the spatial attention weight maps in 4-

bit and real-valued DETR-R50 in Fig. 2.9. We can see that the quantized DETR-R50 bear’s